Application  of  Confidence  Intervals  to  Text-Based  Social  Network  Construction 


Julie  Paynter,  Ian  McCulloh,  John  Graham 
United  States  Military  Academy 
West  Point,  NY  10996 


Abstract 

With  the  increasing  importance  of  gathering  intelligence  on  insurgent  and 
terrorist  groups,  social  network  analysis  (SNA)  has  become  an  important  analytic 
tool.  SNA  is  the  mathematical  methodology  of  quantifying  connections  between 
individuals  and  groups.  This  research  is  focused  on  the  concept  of  centrality, 
which  is  a  mathematical  process  of  determining  which  node  in  a  network  is  the 
most  central,  or  connected.  Thematic,  or  intangible,  relationships  consist  of 
entities  that  are  not  directly  connected,  but  who  share  similar  ideologies.  While 
the  concept  of  centrality  -  the  most  connected  node  -  remains  the  same,  the 
question  becomes  how  to  determine  if  two  nodes  are  connected  where  a  tangible 
relationship  is  not  present.  To  determine  if  there  is  a  connection,  t-confidence 
intervals  are  constructed  for  each  entity.  If  they  share  overlapping  confidence 
intervals,  they  are  connected.  The  connection  is  weighted  based  on  a  scaled 
difference  between  the  means  of  the  confidence  intervals.  The  final  network 
consists  of  nodes  connected  only  across  all  of  the  chosen  intangible  or  thematic 
fields.  Analysis  of  the  network  is  conducted  using  measures  of  degree  centrality. 
This  paper  proposes  a  new  algorithm  for  determining  connection  between  nodes 
in  a  thematic  network,  using  an  analysis  of  radical  jihadist  writings  to  demonstrate 
the  applicability  of  the  method. 
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Introduction 

A  Social  Network  is  a  mathematical  quantification  of  connections  between  groups  or 
people.  Social  Networks  are  based  on  the  idea  that  rational  actors  are  interdependent  beings 
(Wasserman,  et.  al.,  1994).  That  is,  no  one  person  or  event  can  exist  in  a  vacuum.  From  each 
person  and  each  event  there  must  invariably  be  ties  or  links  to  other  people  and  events.  Those 
ties  can  be  pictorially  depicted  as  connections  (relationships)  in  a  network  where  the  people  or 
events  are  nodes.  Behind  each  graphically  displayed  network  is  a  matrix  of  connections  -  each 
node  in  the  chart  receives  a  row  and  column,  and  the  relationship  between  two  nodes  (either 
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binary  or  valued)  is  indicated  in  the  correct  intersection  in  the  matrix.  From  such  charts,  and  the 
matrices  upon  which  they  are  developed,  network  analysts  can  determine  central  actors  or  key 
events.  They  use  known  relationships  to  gain  information  or  to  exploit  other  members  of  the 
network.  For  example,  a  massive,  hand-drawn  and  catalogued  social  network  which  led  U.S. 
forces  to  Saddam  Hussein. 

As  we  have  seen  on  the  battlegrounds  of  the  Global  War  on  Terror  (GWOT),  the  focus  of 
modem  war  is  not  kinetic.  Instead,  commanders  at  all  levels  must  understand  the  “human 
terrain”  of  their  area  of  responsibility.  Much  of  the  information  we  can  use  to  understand  local 
culture,  history,  opinion  and  leadership  is  in  text  form:  newspaper,  memorandum,  religious 
declarations,  letters  and  a  multitude  of  other  sources.  Reading  all  of  the  available  information  is 
time  consuming.  Finding  connections  and  drawing  conclusions  from  the  texts  is  even  more 
challenging.  Social  networking  is  a  solution,  but  the  current  methods  do  not  provide  a  way  to 
compare  key  personnel,  groups  or  locations  across  multiple  fields.  Analysts  are  limited  to 
looking  at  only  one  area  -  such  as  citation  analysis,  where  connections  are  made  when  authors 
cite  each  other.  This  paper  will  explain  a  method  of  finding  connections  across  multiple  fields. 

Literature  Review 

A  new  algorithm  for  constructing  a  social  network  across  multiple  fields  will  be 
demonstrated  on  an  example  data  set  of  radical  Islamist  writers.  The  data  set  is  composed  of 
approximately  250  translated  texts  provided  by  the  Combating  Terrorism  Center  at  West  Point 
(CTC).  From  the  250  texts,  92  were  chosen  that  were  published  by  a  selected  set  of  fifteen 
authors.  The  authors  were  chosen  based  on  two  criteria:  having  more  than  two  texts  in  the 
original  data  set,  and  not  being  well-known.  This  means  that  Bin  Laden  and  Zarqawi,  among 


others,  were  not  analyzed  because  conclusions  reached  about  them  were  not  likely  to  be  new  or 
useful  to  the  CTC.  The  fifteen  authors  are  (in  alphabetical  order  by  first  letter  of  first  word): 
Abd-al-Aziz  al-Muqrin,  Abu-Maysarah  al-Iraqi,  Muhammad  Alshareef,  Muhammad  Naasirud 
Deen  al ,  Shaykh  Abdul  Qadir  Bin  Abdul  Aziz,  Shaykh  'Abdul-'Azeez  Ibn  Baaz,  Shaykh 
Abdullah  Azzam,  Shaykh  Abu  Basir  At-Tartusi,  Shaykh  Abu  Mohammed  al-Maqdisi,  Shaykh 
Hammoud  bin  Uqlaa'  Ash-Shuaibi,  Shaykh  Naasir  ibn  Hamad  al-Fahd,  Shaykh  Rabee'  ibn 
Haadee  al-Madkhalee,  Sheik  Muhammad  Ibrahim  Al-Madhi,  Sheikh  Salman  al-Awdah,  and 
Sheikh  Yussef  Al-Qaradhawi.  The  texts  were  obtained  mainly  from  the  Foreign  Broadcast 
Information  Service  and  the  Middle  East  Media  Research  Institute,  with  a  few  obtained  from 
various  websites.  After  cataloging  all  of  the  administrative  data  about  each  text  and  cleansing  the 
texts  to  account  for  varied  spellings  and  proper  nouns,  they  were  analyzed  using  Crawdad,  a  text 
analysis  software  developed  at  Arizona  State  University  to  perform  centering  resonance  analysis 
(Corman,  et.  al.,  2005).  The  software  assigns  an  influence  value  to  each  word  in  the  text.  The 
CTC  developed  themes  to  look  for  in  the  texts;  each  theme  is  composed  of  a  list  of  words  that 
relate  to  that  theme.  The  eight  themes  are  shown  below  in  Table  1. 


TABLE  1  ABOUT  HERE 

Each  text  was  assigned  a  score  for  each  theme,  where  the  score  consisted  of  the  sum  of  the 
influence  values  of  each  word  in  the  theme.  Once  this  process  was  complete,  the  data  was 
compiled  so  that  each  author  had  a  list  of  theme  scores:  one  for  each  theme  from  each  text.  Using 
basic  descriptive  statistics,  the  following  theme  rankings  were  produced,  these  rankings  will  be 
compared  to  the  results  obtained  using  the  themed  network  algorithm  later  in  the  paper.  Table  2 


shows  theme  ranks  for  the  fifteen  authors  in  each  of  the  eight  themes,  as  well  as  an  average 


theme  rank. 


TABLE  2  ABOUT  HERE 

A  theme  rank  of  1  indicates  that  the  author  had  the  highest  mean  theme  score  for  the  given 
theme,  while  a  score  of  15  indicates  the  lowest  mean  score.  The  “Average  Rank”  column  is 
simply  an  average  of  the  eight  theme  scores  for  each  author,  while  the  “Overall”  column  denotes 
the  author’s  average  theme  rank.  Al-Fahd  has  the  highest  average  rank  (assuming  that  1  is  the 
highest  rank),  thus  he  is  the  overall  number  1  author. 

Further  analysis  of  the  data  set  was  completed  using  a  plagiarism  check.  This  software 
program  checks  texts  for  matching  word  strings  of  five  words  or  more.  Any  two  authors  with 
four  or  more  connections  were  drawn  as  connected  nodes  in  a  network.  The  network  is  shown  in 
Figure  1 . 


FIGURE  1  ABOUT  HERE 


This  method  has  inherent  flaws.  First,  it  does  not  account  for  context  of  the  words  in  the  text  -  it 
is  possible  to  say  the  same  thing  as  another  person  and  to  mean  something  entirely  different. 
Second,  it  oversimplifies  by  assigning  word  matches  to  rote  statements  such  as  the  Koranic 
intonations  that  begin  many  of  the  texts.  Finally,  and  most  importantly,  the  plagiarism  check  is 
weighted  very  heavily  towards  prolific  authors.  AKAwdah  is  the  most  central  person  in  this 
network,  meaning  that  he  has  the  most  connections,  and  he  is  also  the  author  with  the  most 


writing  (in  terms  of  length,  not  number  of  texts).  Maqdisi  is  the  next  most  central,  and  he  wrote 
the  second  highest  amount.  In  fact,  there  is  almost  a  one-to-one  correspondence  between  the 
centrality  rank  of  each  author  and  their  rank  in  terms  of  how  much  they  wrote.  Clearly  this 
method  is  flawed  in  terms  of  providing  meaningful  results.  The  algorithm  proposed  in  this  paper 
improves  upon  these  current  methods  of  constructing  a  social  network. 

Methodology 

Various  mathematical  procedures  have  been  used  to  find  fissures  in  the  Islamic  extremist 
ideological  movement  based  on  jihadist  texts.  Originally  using  ANOVA,  and  then  Kruskal- 
Wallis,  it  was  found  that  neither  method  was  capable  of  producing  meaningful  analysis.  This 
newly  proposed  algorithm  was  developed  as  an  alternative  method  for  finding  connections  across 
multiple  categories  using  mathematically  sound  procedures.  The  algorithm  uses  t-eonfidence 
intervals  and  geometric  means  to  construct  a  network  across  multiple  categories,  or  fields. 
Starting  with  a  set  of  scores  for  each  node  in  the  future  network,  we  construct  confidence 
intervals,  determine  interrelation  scores  based  on  the  intervals,  and  construct  a  matrix.  The 
resultant  network  is  a  representation  of  the  geometric  means  of  each  pair  of  node’s  relationships 
in  each  field. 

Given  a  set  of  nodes  to  analyze  -  radical  Islamic  authors  are  one  example  -  one  must 
create  a  set  of  scores  for  each  node.  Although  there  are  various  ways  to  do  this,  the  example  data 
set  used  in  this  paper  obtained  theme  scores  from  Crawdad,  a  text  analysis  software  program,  for 
each  author  and  each  text.  With  a  set  of  scores  for  each  node,  it  is  possible  to  construct 
confidence  intervals  that  approximate  the  nodes  normal  range  of  values  in  the  specified  field. 


The  primary  drawback  to  confidence  intervals  is  their  symmetry.  This  means  that  a 
grossly  asymmetric  data  set  will  not  actually  meet  the  stated  level  of  confidence.  For  example, 
given  a  data  set  with  an  arbitrary  left  end-point,  such  as  hourly  wages  in  the  U.S.,  the  data  will  be 
skewed  to  the  right.  Skewness  may  cause  the  confidence  level  to  be  closer  to  85%  or  90%  than  to 
95%,  but  if  the  data  set  has  greater  than  thirty  points,  this  becomes  less  of  a  problem  because  of 
the  central  limit  theorem.  Thus,  a  95%  confidence  interval  should  produce  an  acceptable  range  in 
which  that  node’s  mean  score  for  a  certain  field  is  likely  to  fall.  For  the  example  set  in  this  paper, 
there  may  be  some  bias  because  there  were  between  two  and  eighteen  texts  per  author. 

Using  the  set  of  scores  for  each  node,  a  confidence  interval  is  calculated  for  each  node  in 
each  field.  The  confidence  interval  is  given  by, 


X  ^.025,»-l  ‘  I— f  (1) 
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where  x  is  the  sample  mean  of  the  theme  score  for  a  particular  author,  t025^_,  is  the  t-statistic  for  a 

95%  confidence  interval  with  n-1  degrees  of  freedom,  s  is  the  sample  standard  deviation  of  an 
author’s  text  scores,  and  n  is  the  number  of  texts  by  an  author. 

Once  the  confidence  intervals  are  constructed,  they  are  used  to  determine  if  two  nodes  are 
similar  in  each  field.  If  two  nodes  do  not  have  overlapping  confidence  intervals,  then  they 
receive  an  interrelation  score  of  0.  If  two  nodes  have  overlapping  confidence  intervals  in  a  field, 
then  they  receive  an  interrelation  score  given  by. 


s*.j  =  ' 


MD-AD 
MD 


(2) 


where  MD  is  the  maximum  possible  difference  between  any  two  means  within  the  field,  AD  is 
the  actual  difference  between  the  two  nodes  means,  and  shJ  denotes  the  matrix  entry  for  the  two 


nodes  (i  and  j).  This  relationship  score  is  scaled  so  that  a  maximum  difference  of  means  will 
produce  zero,  while  a  difference  of  zero  produces  a  score  of  one.  Scaling  is  necessary  to  ensure 
that  data  across  all  fields  can  be  compared  without  bias. 

Each  relationship  score  is  placed  in  a  matrix  for  each  field.  Table  3  shows  an  example 

matrix. 


TABLE  3  ABOUT  HERE 


Once  matrices  are  constructed  for  each  field,  the  newly  proposed  algorithm  provides  a  method  to 
combine  them,  The  geometric  mean  of  each  pair  of  nodes  for  each  field  is  taken,  and  the 
resultant  number  is  placed  in  a  new  square  matrix,  again  with,  the  nodes  as  the  rows  and 
columns.  The  formula  for  the  geometric  mean  is: 

(rt*f 

where/is  the  number  of  fields  and  a,  is  the  relationship  score  in  field  /  for  the  two  nodes  in 
question.  If  any  pair  of  nodes  has  an  interrelation  score  of  zero  for  any  field,  the  geometric  mean 
ensures  that  the  combined  score  is  multiplied  by  zero,  thus  forcing  their  overall  relationship 
score  to  zero.  The  reason  behind  this  method  is  that  if  two  nodes  are  connected  in  all  themes, 
they  are  much  more  similar  than  nodes  that  are  only  connected  in  a  few  fields.  This  does  leave 
open  the  possibility  that  two  nodes  may  be  connected  in  all  but  one  field  and  register  an  overall 
score  of  zero.  However,  different  risk  functions  could  be  investigated  for  other  applications  of 
themed  network  analysis. 


The  final  result  is  a  square  matrix  containing  the  overall  relationship  scores  of  each  pair 
of  nodes  across  all  measured  fields.  From  this  matrix,  computer  software  is  used  to  construct  the 
graphical  network. 

Results 

The  network  constructed  using  the  newly  proposed  algorithm  differed  from  those 
produced  using  either  the  plagiarism  check  or  the  average  theme  ranks.  The  final  square  matrix 
from  the  new  method  is  displayed  in  Table  4. 

TABLE  4  ABOUT  HERE 

The  matrix  produces  a  more  insightful  network  and  is  shown  in  Figure  2. 

FIGURE  2  ABOUT  HERE 


The  newly  proposed  algorithm,  like  the  average  theme  ranks,  places  Al-Fahd  as  the  most  central 
figure.  However,  the  average  theme  ranks  approach  does  not  account  for  connections  between 
the  authors.  This  makes  the  measure  excellent  at  determining  which  themes  an  author  focuses 
on,  but  not  at  determining  which  author  is  the  most  influential  within  the  group.  A  significant 
advantage  of  the  newly  proposed  algorithm  is  that  it  does  tell  us  who  is  the  most  influential:  Al- 
Fahd,  with  Maqdisi  second.  The  numerical  representation  of  this  (based  on  weighted  degree 
centrality)  is  shown  in  Table  5. 


TABLE  5  ABOUT  HERE 


It  can  be  seen  in  Table  5  that  Al  Fahd  is  the  most  influential  author  in  the  group.  Al-Fahd  is  a 
radical  Saudi  cleric  who  was  very  influential  in  the  jihad  movement  in  the  1990s  (Brachman, 
2006),  which  explains  his  influence  in  the  network.  Maqdisi,  the  second  most  influential,  was 
Zarqawi’s  mentor  when  the  two  were  imprisoned  together  (Al-Zaydi,  2005).  Clearly,  both  men 
have  been  and  still  are  influential  ideologues  within  the  jihadist  movement.  The  largest 
confirmation  of  the  method,  however,  is  that  Ibn  Baaz  is  the  third  most  influential  figure  -  he 
was  not  in  the  top  five  using  either  average  theme  ranks  or  the  plagiarism  check.  Ibn  Baa?:,  also 
known  as  Bin  Baz,  has  been  described  as  one  of  the  “founding  fathers”  of  the  jihadist  movement; 
though  not  an  advocate  of  violence  or  offensive  jihad,  his  writings  laid  the  theological 
groundwork  for  the  current  Islamic  Fundamentalist  jihadists  (Brachman,  2006).  In  any  list  of 
influential  Islamic  writers,  Ibn  Baaz  would  be  included,  and  his  inclusion  here  is  reassurance  that 
the  confidence  interval  method  is  valid.  Also  corroborating  the  proposed  algorithm  is  the 
placement  of  Madklialee  and  Al-Albanee,  two  moderates  who  advocate  peaceful  means  of 
advancing  their  fundamentalist  ideology,  as  outliers.  Within  the  selected  group,  they  are  at  odds 
with  the  majority,  once  again  evidencing  that  the  algorithm  produced  results  consistent  with 
reality. 

Conclusion 

Methods  of  constructing  author  theme  social  networks  that  simply  rank  the  theme  scores 
of  each  author  ignore  the  possibility  and  implication  of  connections  between  the  authors.  A 
simple  word  matching  program  cannot  account  for  content  and  is  greatly  affected  by  the  amount 


of  data  to  analyze.  Themed  analysis,  however,  using  confidence  intervals  to  discern  similarity, 
can  provide  an  improved  measure  of  connectivity.  Furthermore,  using  confidence  intervals 
negates  the  effect  of  voluminous  or  scanty  writings.  By  analyzing  an  example  data  set  of  jihadist 
authors,  the  newly  proposed  algorithm  demonstrates  its  ability  to  provide  analysis  of  either 
abstract  or  concrete  data,  to  find  linkages  across  several  different  fields,  and  to  find  intangible 
connections  such  as  the  thematic  relationships  previously  discussed.  In  short,  this  algorithm 
provides  an  improved  method  for  finding  connections  in  large  amounts  of  textual  data. 
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Table  I  -  Eight  Themes  For  Analysis 


Author 


ai-Fahd 
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Theme  Ranks 


infidellforeignerslbattlegroundslsheikh 


al-lraqi  8 


At-Tartusi  14 


Abdul  Aziz  5 


Madhi  2 


Qaradhawi  15 


Alshareef  3 


Madkhalee 


Al-Awdah  I  13 


al  Albanee  1 


Ibn  Baaz 


Overall 


Table  2  -  Theme  Ranks  of  Fifteen  Jihadist  Authors 


Overall  Theme 
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Table  4  -  Overall  Theme  Interrelation  Scores  from  Proposed  Algorithm 


Figure  2  -  Network  of  Theme  Interrelations  using  Proposed  Algorithm 


Weighted  ! 

Texts 

Degree 

NrmDegree 

Al-Fahd 

8 

9,389 

70.053 

6 

8.549 

Ibn  Baaz 

10 

8.41 

62.745 

Shuaibi 

3 

7.606 

56.753 

Madhi 

4 

7.26 

54.17 

Azzam 

4 

7.185 

53.608 

Abdul  Aziz 

4 

5.818 

43.41 

al- Iraqi 

7 

5.189 

38.714 

10 

4.955 

36.971 

Al-Awdah 

16 

4.105 

30.625 

Alshareef 

2 

3.833 

28.602 

At  Tartusi 

2 

3.55 

26.489 

Qaradhawi 

7 

2.112 

15.76 

Madkhalee 

7 

1.858 

13.864 

al  Albanee 

2 

1.658 

12.368 

Table  5  -  Weighted  Degree  Centrality  of  Theme  Interrelation  Network 


